Chapter 7: Linear Regression with a Single Predictor

Overview

  • 7 Linear regression with a single predictor
    • 7.1 Fitting a line, residuals, and correlation
      • 7.1.1 Fitting a line to data
      • 7.1.2 Using linear regression to predict possum head lengths
      • 7.1.3 Residuals
      • 7.1.4 Describing linear relationships with correlation
    • 7.2 Least squares regression
      • 7.2.1 Gift aid for freshman at Elmhurst College
      • 7.2.2 An objective measure for finding the best line
      • 7.2.3 Finding and interpreting the least squares line
      • 7.2.4 Extrapolation is treacherous
      • 7.2.5 Describing the strength of a fit
      • 7.2.6 Categorical predictors with two levels
    • 7.3 Outliers in linear regression

7.1 Fitting a line, residuals, and correlation

7.1.1 Fitting a line to data

Suppose you want to by \(x\) shares of TGT stock.

  • Your broker charges you $5 for the transaction.
  • TGT is currently selling for $64.96 per share.

\[ y = 5 + 64.96 x. \]

Equations of Lines

A linear relationship between an explanatory variable \(x\) and a response variable \(y\) can be estimated by a regression line:

\[ \hat{y} = b_0 + b_1 x \]

  • The slope of this line is \(b_1\).
  • The y-intercept of this line is \(b_0\).

We use the symbol \(\hat{y}\) to emphasize that this is a predicted value of \(y\).

7.1.2 Using linear regression to predict possum head lengths

Researchers captured 104 of brushtail possums and took body measurements before releasing the animals back into the wild. We consider two of these measurements: the total length of each possum, from head to tail, and the length of each possum’s head.

Group Exercise

The equation of the regression line for predicting Head Length from Total Length is \(\hat{y} = 41 + 0.59x\).

  1. Use the regression model to predict the Head Length of a possum whose Total Length is 76.0 cm.

  2. One of the possums in the sample had a Total Length of 76.0 cm and a Head Length of 85.1 mm. Did the model prediction overestimate or underestimate the Head Length? By how much?

7.1.3 Residuals

Residuals are the leftover variation in the data after accounting for the model fit.

  • Each observation (i.e., point) has a residual.
  • The residual of a point is the vertical distance between the point and the regression line.
    • Data = Fit + Residual
    • Residual = Data - Fit
    • In symbols, the residual of the \(i\)th observation \((x_i, y_i)\) is \(e_i = y_i - \hat{y}_i\).

Group Exercise

Everyone (especially the Operator): open the following applet:

https://www.rossmanchance.com/applets/2021/regshuffle/regshuffle.htm

Group Exercise

  1. Click the Show Regression Line checkbox and record the equation of the regression line.

  2. Click Show Residuals. Find the largest negative residual and click on its point in the scatterplot. Record the value of the residual.

  3. Delete this row by clicking the Delete button that appeared when you did part 2. How did this affect the slope of the regression line?

7.1.4 Describing linear relationships with correlation

The correlation coefficient \(r\) describes the strength and direction of a linear relationship.

  • \(r\) is a number between -1 and 1.
  • Everyone uses software to calculate \(r\).
  • The sign of \(r\) tells you the direction of the relationship.
range of \(r\) Strength Meaning
\(0.7 \leq \lvert r \rvert \leq 1\) Strong Points almost form a line.
\(0.3 \leq \lvert r \rvert \leq 0.7\) Moderate Clear pattern, but bloblike.
\(0.1 \leq \lvert r \rvert \leq 0.3\) Weak Slight pattern.
\(0 \leq \lvert r \rvert \leq 0.1\) None No discernible trend.

A: \(r = -0.54\)

B: \(r = 0.16\)

C: \(r = 0.46\)

D: \(r = -0.44\)

E: \(r = 0.69\)

F: \(r = 0.85\)

Group Exercise

Return to the correlation applet of the previous group exercise.

  1. Reload the applet to get it back to the original state.

  2. Check on Show Regression Line and Correlation coefficient. Record the value of \(r\).

  3. Now try deleting points from the scatterplot and see if you can get the value of \(r\) to be greater than 0.85. How many points did you have to delete?

7.2 Least squares regression

7.2.1 Gift aid for freshman at Elmhurst College

  • Explanatory variable: family income
  • Response variable: how much “gift aid” a student receives

7.2.2 An objective measure for finding the best line

  • The least squares regression line is the line that minimizes the sum of the squares of the residuals.
  • See the regression applet for a demo.

Elmhurst College data

glimpse(elmhurst)
Rows: 50
Columns: 3
$ family_income <dbl> 92.922, 0.250, 53.092, 50.200, 137.613, 47.957, 113.534, 168.579, 208.115, 1…
$ gift_aid      <dbl> 21.720, 27.470, 27.750, 27.220, 18.000, 18.520, 13.000, 13.000, 14.000, 25.4…
$ price_paid    <dbl> 14.280, 8.530, 14.250, 8.780, 24.000, 23.480, 23.000, 29.000, 28.000, 16.530…

7.2.3 Finding the least squares line

model1 <- lm(gift_aid ~ family_income, data = elmhurst)
summary(model1)

Call:
lm(formula = gift_aid ~ family_income, data = elmhurst)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.1128  -3.6234  -0.2161   3.1587  11.5707 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   24.31933    1.29145  18.831  < 2e-16 ***
family_income -0.04307    0.01081  -3.985 0.000229 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.783 on 48 degrees of freedom
Multiple R-squared:  0.2486,    Adjusted R-squared:  0.2329 
F-statistic: 15.88 on 1 and 48 DF,  p-value: 0.0002289

\[ \widehat{\texttt{gift_aid}} = 24.3 - 0.0431 \times \texttt{family_income} \]

Interpreting the coefficients

\[ \widehat{\texttt{gift_aid}} = 24.3 - 0.0431 \times \texttt{family_income} \]

  • The regression slope is the change in \(y\) for a one-unit increase in \(x\).
    • For every additional \(\$1000\) of family_income, the amount of gift_aid decreases by about \(\$43\).
  • The regression intercept is the predicted value of \(y\) when \(x\) is zero.
    • A family with no income should expect about \(\$24,300\) in gift aid.

Making predictions

\[ \widehat{\texttt{gift_aid}} = 24.3 - 0.0431 \times \texttt{family_income} \]

predict(model1, data.frame(family_income = 80))
      1 
20.8736 

The predicted gift_aid for a family_income of \(\$80,000\) is about \(\$20,873\).

7.2.4 Extrapolation is treacherous

\[ \widehat{\texttt{gift_aid}} = 24.3 - 0.0431 \times \texttt{family_income} \]

predict(model1, data.frame(family_income = 3000))
        1 
-104.8956 

The predicted gift_aid for a family_income of \(\$3,000,000\) is negative!

  • Extrapolation refers to making predictions outside of the range of the observed data.
  • Extrapolation should be avoided.

7.2.5 Describing the strength of a fit

  • The square of the correlation coefficient, \(r^2\), is the proportion of variation in \(y\) that is explained by the linear relationship with \(x\).
    • 24.86% of the variation in gift_aid can be explained by the linear relationship with family_income.
    • \(r^2\) is called the coefficient of determination.
    • \(r^2\) is often denoted \(R^2\).

Formulas

\[ r= \frac{1}{n-1} \sum_{i=1}^n \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i -\bar{y}}{s_y} \right) \]

\[ \begin{align} \hat{y} &= a + bx && \text{regression line} \\ b &= r \frac{s_y}{s_x} && \text{sample slope} \\ a &= \bar{y} - b\bar{x} && \text{sample $y$-intercept} \\ \end{align} \]

7.3 Outliers in linear regression

The regression line is not robust

  • Outliers can have a big effect on the regression line.
  • Observations are influential if they lie near the extremes of the \(x\)-variable.
  • See the regression applet.

Group Exercise

  1. Reload the regression applet.

  2. Click on Show Data Options and Show Regression Line.

  3. Add the point \((38, 75)\). Did the regression line change?

  4. Add the point \((38,36)\). Did the regression line change?

  5. Now click on Move Observations. Try moving that last observation around, and observe what happens to the regression line.

  6. Discuss: What types of observations are most influential on the regression line?